GH-41834: [R] Better error handling in dplyr code #41576

nealrichardson · 2024-05-07T15:35:27Z

I started out trying to make it so that arrow_eval() could just raise its errors, rather than catch them and have every caller inspect and re-raise. I ended up pulling on this further and ended up refactoring most of the error handling in the dplyr code paths. Summary of changes, from the bottom up:

We have two wrappers that raise classed errors: arrow_not_supported() (which previously existed but just called stop()) and validation_error(). They raise arrow_not_supported and validation_error, respectively. Function bindings now raise one or the other, never just stop/abort.
arrow_eval() modifies the errors raised by function bindings, inserting the expression as the call attribute of the error, which lets rlang handle the printing cleaner, and catching any non-classed errors and re-raising them as arrow_not_supported or validation_error, as appropriate.
New try_arrow_dplyr() wrapper around everything inside (most*) dplyr verb implementations, which only calls abandon_ship() on arrow_not_supported errors, and lets all other errors just raise. For datasets, it just adds an additional note to the error message advising you that you can call collect(). So errors generally bubble up, and each of these wrappers adds some context to the message.
I also removed the developer vignette writing_bindings.Rmd, per this comment.

The ultimate results of all of this:

We now don't tell people to collect() (or, if on in-memory data, just do it) in cases where it would also fail in regular dplyr because the input is invalid.
Nicer error printing across the board, using rlang/cli for formatting, and cleaner calls and tracebacks. No more Error: Error : messages.
Adds the ability to provide helpful suggestions in error messages in bindings, for cases where there is an alternative available other than just collect(). In fact, if there are suggestions with the ">" (arrow) bullet, we don't just add "Call collect()", we say "Or, call collect()".
For us, it should be easier to work with arrow_eval() and the dplyr verbs in general. There's less bookkeeping you have to do to catch and rethrow errors, and it's consistent across the various parts of the evaluation (i.e. the same thing works inside the dplyr verbs as in the bindings).

Some concrete examples:

Invalid input in a binding. Retry with dplyr won't help, so don't automatically do it (if Table) or suggest it (if Dataset).

# Before: 
mtcars |> 
  arrow_table() |> 
  transmute(case_when())
#> Warning: Expression case_when() not supported in Arrow; pulling data into R
#> Error:
#> ℹ In argument: `case_when()`.
#> Caused by error in `case_when()`:
#> ! At least one condition must be supplied.

# After:
mtcars |>
  arrow_table() |>
  transmute(case_when())
#> Error in `case_when()`:
#> ! No cases provided

Dealing with unsupported features outside of the bindings. This example is something that is checked in summarize() but not caught inside arrow_eval() because it's not about the expressions.

# Before:
mtcars |> 
  InMemoryDataset$create() |> 
  group_by(cyl) |> 
  summarize(mean(hp), .groups = "rowwise")
#> Error: Error : .groups = "rowwise" not supported in Arrow
#> Call collect() first to pull data into R.

# After:
mtcars |>
  InMemoryDataset$create() |> 
  group_by(cyl) |> 
  summarize(mean(hp), .groups = "rowwise")
#> Error in `summarise.arrow_dplyr_query()`:
#> ! .groups = "rowwise" not supported in Arrow
#> → Call collect() first to pull data into R.

When there are ways to solve the issue other than calling collect(), we give the user options:

# After:
mtcars |>
  InMemoryDataset$create() |> 
  transmute(date = as.Date(mpg, tryFormats = c("%Y-%m-%d", "%Y/%m/%d")))
#> Error in `as.Date()`:
#> ! `as.Date()` with multiple `tryFormats` not supported in Arrow
#> → Consider using the lubridate specialised parsing functions `ymd()`, `ymd()`, etc.
#> → Or, call collect() first to pull data into R.

GitHub Issue: [R] Improve error handling in the dplyr NSE code #41834

github-actions · 2024-05-07T15:35:52Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

…apping; make arrow_eval error

…add docs

github-actions · 2024-05-26T19:59:42Z

⚠️ GitHub issue #41834 has been automatically assigned in GitHub to PR creator.

jonkeane · 2024-05-27T20:38:16Z

r/R/dplyr-across.R

-        abort("`...` argument to `across()` is deprecated in dplyr and not supported in Arrow")
+        arrow_not_supported(
+          "`...` argument to `across()` is deprecated in dplyr and",
+          body = c(">" = "Convert your call into a function or formula including the arguments"),


TIL about this making arrows!

jonkeane

Thanks for this! I'm really excited to see more helpful warnings.

I went through some of the code pretty thoroughly, but mostly skimmed the dplyr-{verb}.R files since those are all(?) indentation changes, yeah?

nealrichardson · 2024-05-27T21:47:31Z

r/R/dplyr-arrange.R

-    names(sorts)[i] <- format_expr(exprs[[i]])
-    if (inherits(sorts[[i]], "try-error")) {
-      msg <- paste("Expression", names(sorts)[i], "not supported in Arrow")
-      return(abandon_ship(call, .data, msg))


Here's an example of "not just an indentation change": in the new code, we don't have to evaluate, catch the error, and re-raise in abandon_ship, we just let arrow_eval() raise, and try_arrow_dplyr() handles the abandon_ship.

nealrichardson · 2024-05-27T21:49:52Z

r/R/dplyr-mutate.R

+        !is.null(results[[new_var]])) {
+        # We need some wrapping to handle literal values
+        if (length(results[[new_var]]) != 1) {
+          arrow_not_supported("Recycling values of length != 1", call = exprs[[i]])


Here's another not-just-indentation change: for validations/errors outside of arrow_eval, we just raise arrow_not_supported or validation_error like in the function bindings.

nealrichardson · 2024-05-27T21:51:48Z

I went through some of the code pretty thoroughly, but mostly skimmed the dplyr-{verb}.R files since those are all(?) indentation changes, yeah?

Mostly. I just went back and commented on the PR in a couple places that show some of the non-indentation changes.

nealrichardson · 2024-05-27T23:54:43Z

I should probably go and add some sentences to the writing_bindings.Rmd vignette, at least to x-ref to the man page I added for the new errors.

thisisnic · 2024-05-28T04:32:10Z

I should probably go and add some sentences to the writing_bindings.Rmd vignette, at least to x-ref to the man page I added for the new errors.

Do we definitely still want/need that article? I am a big +1 to removing redundant docs/code, and given that it's buried in the developer docs and it's not likely there'll be a ton of new Acero functions, we could, like, just delete it?

thisisnic

Always a fan of UX changes like this, and I love the usage of → to suggest a concrete action. Out of curiosity, is this something we're emulating from somewhere else or something you came up with on this PR @nealrichardson?

nealrichardson · 2024-05-28T13:34:21Z

Do we definitely still want/need that article? I am a big +1 to removing redundant docs/code, and given that it's buried in the developer docs and it's not likely there'll be a ton of new Acero functions, we could, like, just delete it?

I'm cool with deleting it. You're right that it's from another era in the package's development. And if someone is going to add more bindings, there's hundreds of examples to copy now.

Always a fan of UX changes like this, and I love the usage of → to suggest a concrete action. Out of curiosity, is this something we're emulating from somewhere else or something you came up with on this PR @nealrichardson?

I guess I came up with it. Looking at the options in cli (https://cli.r-lib.org/reference/cli_bullets.html), I wanted to reserve the i information ones for clarifying details, the others didn't seem appropriate, and, well, arrow just seemed like a logical choice for this package :)

### Rationale for this change Missed this in #41576 ### Are these changes tested? We should make sure. ### Are there any user-facing changes? No.

conbench-apache-arrow · 2024-06-06T13:34:51Z

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 774ee0f.

There were 8 benchmark results indicating a performance regression:

Commit Run on test-mac-arm at 2024-05-29 18:14:19Z
- tpch (R) with engine=arrow, format=parquet, language=R, memory_map=False, query_id=TPCH-15, scale_factor=1
- tpch (R) with engine=arrow, format=parquet, language=R, memory_map=False, query_id=TPCH-18, scale_factor=1
and 6 more (see the report linked below)

The full Conbench report has more details. It also includes information about 9 possible false positives for unstable benchmarks that are known to sometimes produce them.

Necessary for a clean check. These were inadvertently taken out in #41576 and don't actually change any code, just appeases the static checker that CRAN runs. Authored-by: Jonathan Keane <[email protected]> Signed-off-by: Jonathan Keane <[email protected]>

### Rationale for this change The writing-bindings vignette was removed in #41576 (comment). It turns out there were more references to it throughout the docs that I failed to remove ### What changes are included in this PR? Deleting x-refs that don't exist anymore. ### Are these changes tested? Not really ### Are there any user-facing changes? The docs won't point you at links that 404. * GitHub Issue: #43665

) ### Rationale for this change The writing-bindings vignette was removed in apache#41576 (comment). It turns out there were more references to it throughout the docs that I failed to remove ### What changes are included in this PR? Deleting x-refs that don't exist anymore. ### Are these changes tested? Not really ### Are there any user-facing changes? The docs won't point you at links that 404. * GitHub Issue: apache#43665

github-actions bot added Component: R Component: Documentation awaiting review Awaiting review labels May 7, 2024

nealrichardson added 9 commits May 26, 2024 15:33

arrow_not_supported() raises a classed error

5166cf7

Better distinguish invalid from not supported; add try_arrow_dplyr wr…

4e2c735

…apping; make arrow_eval error

More classed error raising

fefbc0c

Use cli formatting so we can add bullets; wrap more in try_arrow_dplyr

be79caa

Rename error functions. Start updating tests

66ff9c0

Handle assert_that and match.arg; implement alternative suggestions; …

670e9e5

…add docs

Add some tests, fix some tests

0bda4e3

More test updating

36bf23a

Add more direct tests of dplyr-eval; update remaining expectations

0cd2ff3

nealrichardson force-pushed the better-errors branch from 178648a to 0cd2ff3 Compare May 26, 2024 19:34

nealrichardson marked this pull request as ready for review May 26, 2024 19:34

nealrichardson requested review from paleolimbot and thisisnic as code owners May 26, 2024 19:34

Add some cases with alternatives suggested

df9d081

nealrichardson mentioned this pull request May 26, 2024

[R] Improve error handling in the dplyr NSE code #41834

Closed

nealrichardson changed the title ~~WIP [R] Better error handling in dplyr code~~ GH-41834: [R] Better error handling in dplyr code May 26, 2024

nealrichardson requested a review from jonkeane May 26, 2024 19:59

Tidy up match.call

4df3a7d

jonkeane reviewed May 27, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels May 27, 2024

jonkeane approved these changes May 27, 2024

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels May 27, 2024

nealrichardson commented May 27, 2024

View reviewed changes

thisisnic approved these changes May 28, 2024

View reviewed changes

Remove writing_bindings.Rmd

20e218c

nealrichardson merged commit 774ee0f into apache:main May 29, 2024
11 of 12 checks passed

nealrichardson deleted the better-errors branch May 29, 2024 15:37

nealrichardson mentioned this pull request May 29, 2024

MINOR: [R] Remove writing_bindings from _pkgdown.yml #41877

Merged

nealrichardson added a commit that referenced this pull request May 30, 2024

MINOR: [R] Remove writing_bindings from _pkgdown.yml (#41877)

6800be9

### Rationale for this change Missed this in #41576 ### Are these changes tested? We should make sure. ### Are there any user-facing changes? No.

jonkeane mentioned this pull request Jul 20, 2024

MINOR: [R] add back dplyr:: to left_join calls #43348

Merged

This was referenced Aug 9, 2024

[R] Possible regression in dev arrow #43627

Closed

GH-43627: [R] Fix summarize() performance regression (pushdown) #43649

Merged

nealrichardson mentioned this pull request Aug 29, 2024

GH-43665: [R] Remove references to bindings vignette #43889

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-41834: [R] Better error handling in dplyr code #41576

GH-41834: [R] Better error handling in dplyr code #41576

nealrichardson commented May 7, 2024 •

edited

Loading

github-actions bot commented May 7, 2024

github-actions bot commented May 26, 2024

jonkeane May 27, 2024

jonkeane left a comment

nealrichardson May 27, 2024

nealrichardson May 27, 2024

nealrichardson commented May 27, 2024

nealrichardson commented May 27, 2024

thisisnic commented May 28, 2024

thisisnic left a comment

nealrichardson commented May 28, 2024

conbench-apache-arrow bot commented Jun 6, 2024

GH-41834: [R] Better error handling in dplyr code #41576

GH-41834: [R] Better error handling in dplyr code #41576

Conversation

nealrichardson commented May 7, 2024 • edited Loading

github-actions bot commented May 7, 2024

github-actions bot commented May 26, 2024

jonkeane May 27, 2024

Choose a reason for hiding this comment

jonkeane left a comment

Choose a reason for hiding this comment

nealrichardson May 27, 2024

Choose a reason for hiding this comment

nealrichardson May 27, 2024

Choose a reason for hiding this comment

nealrichardson commented May 27, 2024

nealrichardson commented May 27, 2024

thisisnic commented May 28, 2024

thisisnic left a comment

Choose a reason for hiding this comment

nealrichardson commented May 28, 2024

conbench-apache-arrow bot commented Jun 6, 2024

nealrichardson commented May 7, 2024 •

edited

Loading